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Abstract. Vector quantization(VQ) is a lossy data compression tech- 
nique from signal processing, which is restricted to feature vectors and 
therefore inapplicable for combinatorial structures. This contribution 
presents a theoretical foundation of graph quantization (GQ) that ex- 
tends VQ to the domain of attributed graphs. We present the necessary 
Lloyd-Max conditions for optimality of a graph quantizer and consistency 
results for optimal GQ design based on empirical distortion measures and 
stochcistic optimization. These results statistically justify existing clus- 
tering algorithms in the domain of graphs. The proposed approach pro- 
vides a template of how to link structural pattern recognition methods 
other than GQ to statistical pattern recognition. 



1 Introduction 

Vector quantization is a classical technique from signal processing suitable for 
lossy data compression, density estimation, and prototype-based clustering [7, 
14,30]. The problem of optimal vector quantizer design is to find a codebook 
consisting of a finite set of prototypes such that an expected distortion with 
respect to some (differentiable) distortion measure is minimized. 

Since the probability distribution of the input patterns is usually unknown, 
vector quantizer design techniques use empirical data. Extensively studied design 
techniques are, for example, k-means and simple competitive learning. The k- 
mcans algorithm is also commonly referred to as the Linde-Buzo-Gray (LBG) 
algorithm [24] the generalized Lloyd algorithm [25]. This algorithm is a local 
optimizer of the empirical sum-of-squared-error distortion without any global 
optimal or consistency guarantees. In contrast to k-means, competitive learning 
directly minimizes the expected distortion and is a consistent learner under very 
general conditions in the sense that it almost surely converges to a local optimal 
solution of the expected distortion. 

One limitation of VQ is its restriction to patterns that are represented by 
vectors. For patterns that are more naturally represented by finite combinatorial 
structures, the theoretical framework of VQ as well its design techniques are 
no longer applicable. Examples of such structures include, for example, point 
patterns, strings, trees, and graphs arising from diverse application areas like 
protcomics, chcmoinformatics, and computer vision. 

To overcome this limitation, we generalize vector quantization to quantiza- 
tion of graphs. A number of graph quantizer design techniques for the purpose of 



prototype-based clustering have already been proposed. Examples include com- 
petitive learning algorithms in the domain of graphs [16-20, 22] and k-means as 
well as k-medoids algorithms [12, 13, 19, 20, 23, 28, 29[. Related clustering method 
are presented in [3,26,31[. Due to a lack of an appropriate theoretical frame- 
work, all these graph quantizer design techniques (or clustering methods) have 
been developed in order to minimize an empirical distortion function without 
justifying whether the solutions found are statistically consistent estimators of 
the true but unknown solutions. In addition, it is unclear whether the nearest 
neighbor and centroid condition, which are also referred to as the Lloyd-Max 
conditions, are necessary conditions for optimality. 

In this contribution, we propose graph quantization in a mathematically prin- 
cipled way as an extension of vector qiiantization, where we consider the graph 
edit distance as an underlying graph distortion measure. The key results of this 
contribution are consistency statements for estimators based on empirical dis- 
tortion measures and estimators based on stochastic optimization. Furthermore, 
we prove that the Llyod-Max conditions are also necessary condition for optimal 
graph quantizers. In order to achieve the consistency results and the Lloyd-Max; 
conditions, we isometrically embed - without loss of structural information - 
graphs as points into some Riemannian orbifold. An orbifold is the quotient of a 
manifold by a finite group action and therefore generalizes the notion of mani- 
fold. Using orbifolds we can define geometric and analytic concept such as length, 
angle, derivative, gradient, and integral locally to a Euclidean space. This con- 
struction forms the basis for extending consistency results from Euclidean vector 
spaces to the domain of graphs. 

The proposed approach has the following properties: First, it can be applied 
to finite combinatorial structures other than graphs like, for example, point pat- 
terns, sequences, trees, and hypergraphs. For the sake of concreteness, we restrict 
our attention exclusively to the domain of graphs. Second, for graphs consisting 
of a single vertex with feature vectors as attributes, graph quantization coincides 
with vector quantization. Third, the proposed consistency results justify some of 
the above referenced graph clustering methods as statistically consistent learn- 
ers. Fourth, the underlying mathematical framework can be applied in order to 
link other structural pattern recognition methods that directly operate in the 
domain of graphs to methods from statistical pattern recognition. 

The paper is organizes as follows. Section 2 describes the problem of graph 
quantizer design. Section 3 introduces Riemannian orbifolds. In Section 4, we 
extend VQ to GQ and present consistency result for GQ design techniques. 
Section 5 briefly discusses the case of general graph edit distance functions. 
Finally, Section 6 concludes. 

2 The Problem of Graph Quantizer Design 

This section aims at outlining the problem of extending VQ to the quantization 
of graphs. 



2.1 Attributed Graphs 

To begin witli, we first describe the structures we want to quantize. 



Let ^ be a set of attributes and let e G Ahe a, distinguished clement denoting 
the null or void element. An attributed graph is a tuple X = {V, a) consisting of 
a finite nonempty set V of vertices and an attribute function a : V x V ^ A. 
Elements of the set 

E = {{i,j) eV xV : ij^j and a{i,j) ^ s} 

are the edges of X. By Qj{ wc denote the set of all attributed graphs with 
attributes from A. The vertex set of an attributed graph X is often referred to 
as Vx and its attribute function as ax- 

An alignment of a graph X is a graph X' with Vx C Vx' and 



ax'{i,j) 



ax{i,j) ■■ ihj)&VxxVx 
e : otherwise 



for all i,j G Vx' ■ Thus, we obtain an alignment of X by adding isolated vertices 
with null- attribute. The set Vx' \ Vx is the set of aligned vertices. By A{X) we 
denote the (infinite) set of all alignments of X. 

A pairwise alignment of graphs X and y is a triple {(j), X' ,Y') consisting of 
alignments X' G A{X) and Y' € A{Y) together with a bijective mapping 

<j> : Vx' Vy, i H> i'^. 

By A{X, Y) we denote the set of all pairwise alignments between X and Y. 
Sometimes we briefly write 4> instead of (</>, X', Y'). 



2.2 The Graph Edit Distance 

Fundamental for quantizing data is the notion of distortion. This section briefly 
introduces the graph edit distance functions as our choice of distortion measure. 
For a more detailed deflnition of the graph edit distance, we refer to [2]. In 
addition, we present an important graph metric based on a generalization of 
the concept of maximum common sTibgraph, which arises in various different 
guises as a common choice of proximity measure [1,5,6,15,32,33]. For sake of 
convenience, we assume that all distances are metrics. 



Each pairwise alignment (</), A',!") e A{X,Y) can be regarded as an edit 
path with cost 

d^{X,Y)= J2 dA{ax'{i,j),aY'{i'^,j'^)), 
ijevx' 

where d_A Ax A ^ M+ is a distance function defined on the set A of attributes. 
Observe that deletion (insertion) of vertices also deletes (inserts) all edges the 
respective vertices are incident to. 



The graph edit distance of X and Y is then defined as the edit path with 
minimal cost 

d{X, Y) = min {d^ {X, Y) : ^ e A(X, Y)} . 

Note that the sot A(X, Y) of pairwise aUgnments is of infinite cardinality. But 
since (i^(e, e) — 0, we actually take the minimum over a finite subset by ignoring 
all pairwise alignments that map aligned vertices with null-attributes onto each 
other. 

Next, we consider an important example of the graph edit distance based on 
a generalization of the concept of maximum common subgraph. We derive this 
graph metric from a similarity measure in the same way the Euclidean distance 
is derived from an inner product. 

Suppose that kj^ : A x A ^ M. with e) = is a positive definite kernel. 
We measure the quality of a pairwise alignment (j) € A{X, Y) by 

k4,{X,Y)= ^ kA{ax{i,j),aY{itj'^)). 

An optimal alignment kernel is a graph similarity measure of the form 

k{X, Y) = max {k^{X, y) : ^ e A{X, Y)) . (1) 

Note that fc (1) is symmetric but indefinite as a pointwise maximizer of a set of 
positive definite kernels. 

The distance metric on Qj( induced by an optimal alignment kernel k{-\-) is 
defined by 

d{X, Y) = ^l{Xy-2k{X,Y) + l{Yy, (2) 

where 1{X) = ^yk{X, X) denotes the length of an attributed graph X. As shown 
in [23], d is indeed a metric and can be expressed as a graph edit distance. 

2.3 The Problem of Graph Quantizer Design 

Let be a graph distance space, where d{-\-) is a graph edit distance. 

Optimal graph quantization design aims at minimizing the expected distortion 

D{C)= [ d{X,Q{X))dP{X), 
JQa 

where Q : — >• C is a graph quantizer, C = {Y\, . . . ,Yk} a codebook consist- 
ing of k code graphs, and P = Pg^ is a probability measure defined on some 
appropriate measurable space {Ga^ ^Qa)- 

As opposed to vector quantization, the following factors complicate designing 
an optimal graph quantizer in a statistically consistent way: 

1. The graph distance d{X,Y) is in general non-convex and non-differentiable. 

2. Neither a well-defined addition on graphs nor the notion of derivative for 
functions on graphs is known. 



To overcome these difficulties, we isometrically embed graphs as points into a 
Riemannian orbifold in order to apply methods that generalize gradient descent 
techniques and methods from stochastic optimization for non-convex and non- 
differentiable distortion functions. 

3 Riemannian Orbifolds 

Orbifolds generalize the notion of manifold as locally being a quotient of M" by 
finite group actions. Consequently, learning on orbifolds generalizes learning on 
Euclidean spaces and Riemannian manifolds. This section introduces Rieman- 
nian orbifolds and their intrinsic metric structure. Proofs for new results are 
delegated to Section B.l. For all other proofs we refer to [4, 21]. 

3.1 Riemannian Orbifolds 

To keep the treatment simple, we assume that X = K" is the n-dimensional 
Euclidean vector space, and F is a, permutation group acting on X. In a more 
general setting, however, we can assume that X is a, Riemannian manifold, and 
r is a. finite group of isometrics acting effectively on X. 
The binary operation 

■ : r X X X, (7, 03) H->- 7(03) 

is a group action of F on X. For x G X, the orbit of x is the set defined by 

[x] = Mx) : 7 e r} . 

The quotient set 

Xr = X/F = {[x] : xeX} 

consisting of all all orbits carries the structure of a Riemannian orbifold. Its 
orbifold chart is the surjective continuous mapping 

TT : A* — ^ Xp, X I-)- [x] 

that projects each point x to its orbit [x]. 

In the following, an orbifold is a triple Q = {X, F, tt) consisting of an Eu- 
clidean space X, a permutation group F acting on X and its orbifold chart tt. 
With F = {id} being the trivial permutation group consisting of the identity 
only, a manifold X is also an orbifold. In general, however, the underlying space 
Xr of an orbifold is not a manifold. Thus, orbifolds generalize the notion of 
manifold. The points at which an orbifold Xr is locally not homeomorphic to 
a manifold are its singular points. We call the elements of Xr structures, since 
they represent combinatorial structures like attributed graphs. We use capital 
letters X.Y, Z, . . . to denote structures from Xp and write, by abuse of notation, 
X € X if Tr{x) = X. Each vector x € X is a, vector representation of structure 
X and the set X of all vector representation is the representation space of Xp. 



Example 1. Let X ='E? and let F be the group generated by reflections across 
the main-diagonal of the x-y-plane. Then Q = {Xr,r,Tr) is a Riemannian orb- 
ifold with 

TT -.X ^ Xr, X = {X1,X2) I-)- [x] = {{xi,X2), {X2,Xi)} . 

The singular points of Xp are all structures X represented by vectors x = {xi,X2) 
with xi = X2- 

3.2 The RiemEinnian Orbifold of Attributed Graphs 

In this section, we show that attributes graphs can be identified with points in 
some Riemannian orbifold. 

Riemannian orbifolds of attributed graphs arise by considering equivalence 
classes of matrices representing the same graph. To identify graphs with points 
in a Riemannian orbifold without loss of structural information, some technical 
assumptions and restrictions to simplify the mathematical treatment are nec- 
essary. For this, let be a graph distance space with graph edit distance 
d{-\-). Then we make the following assumptions: 

PI There is a feature map ^ : H oi the attributes into some finite dimen- 
sional Euclidean feature space and a distance function d-n :'HxH ^ K+ 
such that ${e) = and 

dA{a,a')=dnma),^{a')) 

for all attributes a, a' G A. 

P2 All graphs are finite of bounded order n, where n is a sufficiently large 
number. Graphs X of order less than n, say m < n, are aligned to graphs 
X' of order n by inserting p = n — m isolated vertices with null attribute s. 

Before discussing the impact of both assumptions for practical application, we 

first restate our first assumptions for graph metrics induced by optimal alignment 
kernels. By definition A;^ : ^ x ^ — >• M is a positive definite kernel corresponding 
to an inner product kj^{x,y) = in some feature space H. Our first 

assumption requires that "H is a finite dimensional Euclidean space and (P{s) = 0. 

Now let us consider the above assumptions in more detail. Both conditions 
do not effect the graph edit distance, provided an appropriate feature map for 
the attributes can be found. Restricting to finite dimensional Euclidean feature 
spaces H is necessary for deriving consistency results and for applying meth- 
ods from stochastic optimization. Limiting the maximum size of the graphs to 
some arbitrarily large number n and aligning smaller graphs to graphs of oder 
n are purely technical assumptions to simplify mathematics. For machine learn- 
ing problems, this limitation should have no practical impact, because neither 
the bound n needs to be specified explicitly nor an extension of all graphs to an 
identical order needs to be performed. When applying the theory, all we actually 
require is that the order of the graphs is bounded. 



With both assumptions in mind, we construct the Riemannian orbifold of 
attributed graphs. Let X = fi"^^" be the set of all (n x n)-matrices with elements 
from feature space H. A graph X is completely specified by a representation 
matrix X = (xij) from X with elements 

{(j){lJ.x{i)) : i=j 
(t>{i'x{i,j)) ■■ {i,j)&E 
: otherwise 

for all i,j & Vx- The form of a representation matrix X of X is generally not 
unique and depends on how the vertices are arranged in the diagonal of X. 

Now suppose that 71" be the set of all (n x n)-permutation matrices. For 
each P e iT" we define a mapping 

jp:X^X, X^ P^XP. 

Then F = {'jp : P G U"} is a permutation group acting on X. Regarding an 
arbitrary matrix X as a representation of some graph X, then the orbit [X] 
consists of all possible matrices that can represent X. By identifying the orbits 
of Xp with attributed graphs, the set Ga of attributed graphs of bounded order 
n is a Riemannian orbifold. 

3.3 Metric Structures 

Let Q = (X.r.ir) be an orbifold. We derive an intrinsic metric that enables us 
to do Riemannian geometry. In the case of a Riemannian orbifold of attributed 
graphs the intrinsic metric coincides with the graph metric of (2) induced by an 
optimal alignment kernel. 

Any inner product (•, •) on X gives rise to a maximizer of the form 

k-.XpxXr^M., {X,Y) max{(a;,y) : x £ X,y eY} . 

We call the kernel function k{-\-) optimal alignment kernel, induced by the inner 
product (•,•). Note that the maximizer of a set of positive definite kernels is an 
indefinite kernel in general. Since r' is a group, we find that 

k{X,Y) = max{{x,y) : x € X} . 

where y is an arbitrary but fixed vector representation of Y. In general, we have 

k{X,Y)>{x,y) 

for all x £ X and y &Y. 

Example 2. Consider the Riemannian orbifold (X^r^Tr) of Example 1, where 
X = R"^ and F = {id, 7} is the group generated by reflections across the x-y- 
plane. Suppose that x = (1, 2) is a vector representation of X and y = (3, 2) is a 



vector representation of Y. Then the optimal aUgnment kernel k {X, Y) induced 
by the standard inner product of X is given by 

= max {{x, y), {'y{x),y), {x, 7(y)), {^{x),j{y))} 

Evaluating the inner products yields 

(a;,y) = ((l,2),(3,2)) = 7 
(7(a;),y) = ((2,l),(3,2))=8 
(a=,7(y)) = ((l,2),(2,3))=8 
(7N,7(y)) = ((2,l),(2,3)) = 7. 

Thus, we have k{X, Y) = 8. 

Example 3. Suppose that X and Y are attributed graphs where edges have at- 
tribute 1 and vertices have attribute 0. The optimal alignment kernel k {X. Y) 
induced by the standard inner product of X is the number of edges of a maximum 
common subgraph of X and Y. 

Example 4- More generally, if property PI is satisfied, then any optimal align- 
ment kernel on a bounded set of attributed graphs as defined in (1) is also an 
optimal assignment kernel of some Riemannian orbifold. 

Suppose that X € Xp- Since k{X,X) = {x,x) for all x & X, we can define 
the length of X by 

1{X) = ^Jk{X,X). 

The optimal alignment kernel together with the length satisfies the Cauchy- 
Schwarz inequality 

\k{X,Y)\<l{X)-l{Y). 

Since the Cauchy-Schwarz inequality is valid, the geometric interpretation of 
fc(-|-) is that it computes the cosine of a well-defined angle between X and X' 
provided they are normalized to length 1. 

Likewise, k{-\-) gives rise to a distance function defined by 

d{X,Y) = ^1{XY -2k{X,Y)+ 1{Y). 
From the definition of fc(-|-) follows that d is a metric. In addition, we have 

d{X,Y) = mmiWx - y\\ : x€X,y&Y}, (3) 

where denotes the Euclidean norm induced by the inner product (•, •) of the 

Euclidean space X . 

Example 5. Consider the Riemannian orbifold (X,r,iT) of Example 1 and 2. 
Suppose that cc = (1,2) is a vector representation of X and y = (3,2) is a 
vector representation of Y . Then the squared lengths of X and Y are liX)"^ = 5 
and liyY = 13. Since k{X,Y) = 8 according to Example 2, the distance is 
d{X, Y) = V5 - 16 + 13 = \/2. 



Example 6. If properties PI and P2 are satisfied, then the graph metric (2) 
coincides with the intrinsic orbifold metric (3). 

Equation (3) states that d{-\-) is the length of a minimizing geodesic of X 
and Y and therefore an intrinsic metric, because it coincides with the infimum 
of the length of all admissible curves from X to Y. In addition, we find that the 
topology of Xr induced by the metric d coincides with the quotient topology 
induced by the topology of the Euclidean space X. 

3.4 Orbifold Functions 

Suppose that Q = {X, F, w) is an orbifold. An orbifold function is a mapping 

f : Xr^R. 

The lift of / is a function 

f : X 

satisfying f — f o tt. The lift / is invariant under group actions of F, that is 
/(a;) = /(7(a;))for all 7 e T. 

We say, an orbifold function / : Xp — > M is continuous (locally Lipschitz, 
differentiable, generalized differentiable) a,t X G Xr if its lift / is continuous 
(locally Lipschitz, differentiable, generalized differentiable) at some vector rep- 
resentation X E X. The definition is independent of the choice of the vector 
representation that projects to X (see Section B.l, Prop. 1 - Prop. 4). For a 
definition of generalized differentiable functions and their basic properties we 
refer to Section A. 

Example 7. Consider the Riemannian orbifold {X, F, it) of Example 1-5. The 
function 

fy-.Xr^R, X^k{X,Y) 
for some Y € Xp is an orbifold function with lift 

fy-.X^R, max{{x,y),{x,'y{y))}, 

where y G Y. Analytical properties of / such as continuity and differentiability 
can be investigated using the lift / of /. For example, if / is differentiable 
at X G X then it is also differentiable at 7(0;) according to Prop. 3. Hence, 
differentiability of the orbifold function / is well-defined at X. 

3.5 Gradients and Generalized Gradients of Orbifold Functions 

We extend the notion of gradient and generalized gradient to differentiable and 
generalized differentiable orbifold functions. 



Gradient of Differentiable Orbifold Functions. Suppose that / : Xp — >■ M is 
differentiable at X G Xp- Then its hft f : X 'R is differentiable at all vector 
representations that project to X. The gradient Vf{X) of / at X is defined by 
the projection 

V/(X) = TT (v/>)) 

of the gradient Vf{x) of / at a vector representation x G X. This definition is 
independent of the choice of the vector representation. We have 



V/(7(x)) = 7 (v/(x)) 



for all J G r. This implies that the gradients of / at a; and j{x) are vector 
representations of the same structure, namely the gradient V/(X) of the orbifold 
function / at X. Thus, the gradient of / at X is a well-defined structure pointing 
to the direction of steepest ascent (see Section B.l, Prop. 3). 

Subdifferential of Generalized Differentiable Orbifold Functions. Suppose that 
f : Xr ^ M. is generaUzed differentiable a.t X G Xp. Then its lift / : Af R 
is generalized differentiable at all vector representations that project to X. The 
subdifferential df{X) of / at X is defined by the projection 



df{x) = TT (df{x) 



of the subdifferential df(x) of / at a vector representation x E X. This definition 
is independent of the choice of the vector representation. We have 



df{l{x)) = 7 (df{x)) 



for all J E r. This implies that the subdifferentials df{x) C X and df{j{x)) C 
X are subsets that project to the same subset of Xp, namely the subdifferential 
df{X) (see Section B.l, Prop. 4). 

The properties of generalized differentiable function as listed in Section A 
carry over to generalized differentiable orbifold functions via their lifts. For 
example, a generalized differentiable orbifold function is locally Lipschitz and 
therefore differentiable almost everywhere. 

Example 8. Let {QA,d) be a graph space, where 

d{X,Y)= min dJX,Y) 

is a graph edit distance. We can identify Ga with a Riemannian orbifold Q = 
{X,r,TT) and the graph edit distance d{-\-) with a distance function defined on 
Xr- Suppose that the cost functions d^ {■]■) of the edit paths are continuously 
differentiable (generalized differentiable). Then the distance d{-\-) is generalized 
differentiable. 

Example 9. Let Q be a Riemannian orbifold of attributed graphs. Then (i) an 
optimal assignment kernel k (-j-), (ii) the intrinsic metric d{-\-) induced by k {■]■), 
and (iii) the squared metric rf(-|-)^ are generalized differentiable. 



3.6 Integration on Orbifolds 



Suppose that Q = (A",/^, tt) is a Riemannian orbifold witii singular set Sq. In 
order to integrate orbifold functions / : Xp — >• M by the Lebesgue integral, we 
need to constrTict an appropriate measurable space together with an orbifold 
measure. The measurable space is defined by the Borel set B{Xr) generated by 
the open sets of Xp- Prom the orbifold measure we expect that it is compatible 
with the local Riemannian measures. In addition, we demand that the singular 
set Sq has measure 0. This is motivated by the following fact: The singular 
set is covered locally by the finite union of totally geodesic submanifolds, which 
has measure relative to the local canonical Riemannian measure. Since the 
projection to the orbifold is distance decreasing, it is reasonable to ask for an 
orbifold measure that assigns measure to the singular set Sq. 

Let B {XpXSq) denote the Borel set generated by the open sets of Xr \ 
Sq. Then there exists a complete canonical measure n on the the Borel set 
B {Xr \ Sq) given by a unique volume form on Xp \ Sq. The measure can be 
extended to a complete measure v on the Borel set B{Xr) such that 

v{A)=ii{A\Sq) = / d^i. 

JA\Sa 

In particular, we have y{A) = for any subset A C Sq. For proofs we refer to 
[4]. 

In the following we write 

/ f{X)dX= [ fdv 

JUr JUr 

for the integral of an orbifold function / : Up — > M defined on a measurable 
subset Ur C Xp. We tacitly assume that all integrals occurring in the following 
sections exist. 

4 Graph Quantization 

This section extends vector quantization to quantization of graphs. 
4.1 The Basics 

Suppose that Q = {X, F, n) is a Riemannian orbifold. A graph quantizer of size 
fc is a mapping of the form 

Q:Xr^C 

where C = {Yi, . . . , Yk} C Xp is a finite set, called codebook. The elements Yj G C 
are the code graphs. The graph quantizer Q partitions the input space Xp into 
k disjoint regions 

Tlj = {X&Xr : Q{X) = Yj} 



such that their union covers AV- By Vq we denote the partition of Q consisting 

of all k regions TZj. 

Suppose that J = {!,... ,k}. The basic operation of a vector quantizer Q 
can be written as a composition Q = o eg of an encoder eg : Xp J and a 
decoder dq : J ^ C. The encoder assigns each input graph to a region via the 
index set J. The decoder maps indices of J referring to regions to code graphs. 

4.2 Graph Quantizer Performance 

We measure the performance of a graph quantizer Q by the expected distortion 



where X e Xp is a random variable with probabiUty measure P = Pxr rep- 
resenting the observable graphs to be quantized. The expectation Ex is taken 
with respect to some probabiUty space {Xp, Sxr^ Pxr)- The quantity d{X,Y) 
measures the distortion of the random input graph X and code graph Y. Here 
we consider graph distortion measures that are graph edit distances. An example 
is the squared metric induced by an optimal alignment kernel 



Using the codebook and partition for the given quantizer Q, we can rewrite the 
expected distortion by 



4.3 The Problem of Optimal Graph Quantizer Design 

The problem of optimal graph quantizer design is stated as follows: Find a 
codebook C specifying the decoder dg and a partition Vq specifying the encoder 
eg such that the expected distortion D(Q) is minimized. The composite mapping 
Q = rfg o eg of the resulting encoder and decoder is then an optimal graph 
quantizer. 

An optimal graph quantizer satisfies the following necessary conditions, also 
known as the Lloyd-Max conditions: 

1. Nearest Neighbor Condition. Given a fixed codebook C, a graph quantizer 
Q is optimal, if the code vector Q{X) of an input pattern X satisfies the 
nearest neighbor rule 



D{Q) = Ex [d (X, Q{X))] = [ d{X, Q{X))dP{X) 



JXr 



d{X,Y)= min ||a;-y|| 



2 




Q{X) = arg min (i(X,y) 



for all X e Xr, where ties are resolved according to some rule. A proof is 
given in Section B.2, Theorem 3. 



2. Centroid Condition. Given a fixed partition Vq, a vector quantizer Q is 
optimal, if each code vector Yj is the centroid of region TZj, that is 

y, =arg min E[diX,Y) | X € 7^,] 

for all Y e Xr and all j G J. A proof is given in Section B.2, Theorem 4. 
Note that Yj with 

Y,=arg min E[d{X,Y) | X e 

is called a centroid of region T^j. The centroids may not be unique. This also 
holds for squared metrics induced by some optimal assignment kernel, which are 
the counterparts of squared Euclidean distances. 

4.4 Graph Quantizer Design 

Since the distribution P = of the observable graphs is usually unknown, the 

expected distortion D{C) can neither be computed nor be minimized directly. 
Instead, we design (estimate) an optimal quantizer from empirical data. For 
vectors, prominent methods for designing an optimal quantizer are k-means and 
simple competitive learning. Both methods, k-mcans and simple competitive 
learning have been extended for designing graph quantizers in the context of 
prototype based clustering. To derive consistency results for k-means and simple 
competitive learning in the domain of graphs, we consider estimators based on 
empirical distortions and on stochastic approximation. 

Estimators based on Empirical Distortion Measures. In order to de- 
rive consistency results, we restrict the set of feasible codebooks to a compact 
subspace 

^ V ' 

k-times 

of the topological space Xp. The problem of designing an optimal quantizer for 
graphs is then of the form 

k 

min D{C) = V 

:;=i 

where the minimum is taken over the compact set W rather than Xp. Let 

1. D* be the set of minimal values of the expected distortion -D(C), 

2. W* = {C G W : D{C) = £)*} be the set of true (optimal) codebooks, and 

3. W* = {C e W : D{C) <D* + e] be the set of approximate solutions. 



/ d{X,Y)dP{X). 



To design an optimal graph quantizer, we minimize the empirical distortion 



1 ^ 

i—l 



where C e W and S = {Xi, . . . , X^} is a training set consisting of N indepen- 
dent graphs Xi drawn from Xp. Let 

1. D% be the set of minimal values of the empirical distortion Dn{C), 

2. = {C e W : Dn{C) = £>%} be the set of empirical codcbooks, and 

3. = {C e W : Dn{C) < D*j^ + e} be the set of approximate solutions. 

The next result shows that estimators based on empirical distortions are consis- 
tent estimators. 

Theorem 1. Suppose that Q = {X,r,Tr) is a Riemannian orbifold, d{X,Y) is 
a locally Lipschitz metric on Xp with integrable Lipschitz constant, and W C Xp. 
is compact. Then we have 

hm D% (uj) = D* 
lim Wtf (w) = W* 

iV— >cx) 

um w:^ (uj) = w: 

JV— >cx) 

almost surely. 

The proof follows from [8] applied to the lift d of distortion d. Examples of locally 
Lipschitz distance metrics on Xp with integrable Lipschitz constants are metrics 
induced by an optimal alignment kernel 



d{X,Y)= mm \\x-y\\ 



as well as d{X,Y)^. 



K-Means. In order to extend the standard k-means method to graphs for con- 
structing an empirical codebook, we use the following update rule 



1 ^ 



3 j=l 

where t > is the iteration, Xi G Xi and y* e Yj are vector representations 
that are optimally aligned,^ and Q* — (qlj) is the matrix representation of the 
nearest neighbor quantizer Q* restricted to the training set S. The elements of 
Q* are of the form 



1 : Q\X,) = Y, 



1 : otherwise 



^ Recall that two vector representations x G X and y £ Y are optimally aligned if 
\\x - y\\ = d{X,Y) 



The quantity A''* denotes the number of elements from the training sets that are 
quantized by code graph . 

As for vectors, a drawback of k-means for graphs is that it is a local optimiza- 
tion technique for which existing consistency theorems are inapplicable, because 
Theorem 1 assumes global instead of local minimizers of the empirical distortion 
as estimators. 

Estimators bcised on Stochastic Optimization. Suppose that W = Xp. 
Stochastic optimization methods directly minimize the expected distortion 



D{C) = ^[ d{X,Yj)dP{X) 



mm d{X,Yj)dP{X), 



using a training set S = {Xi, . . . , X^} of N independent graphs Xi drawn from 
Xr- We assume that the loss function 

L{X,C)= min d{X,Y,) 

l<j<k 

is generalized-differentiable, hence L{X,C) is differentiable almost everywhere. 

Example 10. If he graph distortion d{-\-) is generalized differentiable, then the 
loss function L{X,C) is also generalized differentiable by calculus of generalized 
differentiable functions. This holds for graph distortions of Example 8 and 9. 

Since the interchange of integral and generalized gradient remains valid for 
generalized differentiable loss functions, that is 

dD{C)=Ex [dL{X,C)] 

under mild assumptions (see [11,27]), we can minimize the expected distortion 
D{C) according to the following stochastic generalized gradient (SGG) method: 

yt+i=yt + vt{xt-yt), (4) 

where Xf is a vector representation of input pattern Xf e S, which is opti- 
mally aligned to vector representation yt of a code graph Yt closest to Xt. The 

random elements St = Xt — yt & St are vector representations of stochastic 
generalized gradients St, i.e. random variables defined on the probability space 
{Xr, Sxr ) Pxr )°° such that 

E[5t|Co,...,Ci] g5D(C). (5) 



We consider the following conditions for almost sure convergence of stochastic 
optimization: 



Al The sequence {r]t)t>o of step sizes satisfies 

oo oo 

r]t > 0, lim r?t = 0, S^Vt = oo, V r]f < oo. 

i=l t=l 

A2 The stochastic generaUzed gradients (S't)j>g satisfy (5). 
A3 The expected squared norm of stochastic" generaUzed gradients (<S't)t>o is 
bounded by 



E 



< +O0. 



\St\ 

The next result shows that the SGG method is a consistent estimator. 

Theorem 2. Let Q = (A',/', tt) be a Riemannian orbifold and let d{X,Y) he a 
generalized dijjerentiable metric on Xp- Suppose that assumptions (Al) — {A3) 
hold. Then the sequence (Ct)j>Q generated by the SGG method converges almost 
surely to graphs satisfying necessary extremum conditions 

W* = {Ce W : OGdD{C)}. 

Besides the sequence {D{Ct))^yQ converges almost surely and we have 

lim D{Ct) e D{W*). 

t—^oo 

The proof is a direct consequence of Ermoliev and Norkin's Theorem [11] applied 
on the lift d{-\-) ofd{-\-). 



5 Remarks to GQ using the Graph Edit Distance 

In many applications, the graph edit distance is discontinuous. Examples include 
edit distances with constant non-zero deletion and/or insertion cost. A necessary 
(but not sufficient) condition for the consistency results stated in Theorem 1 
and 2 is that the underlying graph distortion is locally Lipschitz. Hence, both 
consistency results are inapplicable for discontinuous graph distortions. Let us 
consider both cases separately. 

Estimators based on Empirical Distortion Measures. Estimators based on em- 
pirical distortion measures aim at approximating the expected distortion D{C) 
by its empirical mean 



min 

cew 




As shown in [10], minimizing the empirical distortion is often meaningless, if the 
underlying graph edit distance function d{-\-) and thus Dn{C) is discontinuous, 
even if the expectation D{C) may be continuously difFcrcntiable. Since the local 
solutions of Dj^{C) may have nothing in common with the local solutions of the 
original problem, estimators based on the empirical distortion Dn{C) can be 
statistically inconsistent. Hence, minimizing Dn{C) with underlying discontin- 
uous graph edit distance using global or local optimization techniques like, for 
example, k-means lacks theoretical support. 



Estimators based on Stochastic Optimization. The situation is better for esti- 
mators based on methods from stochastic optimization. For discontinuous graph 
edit distances d{-\-) the expected distortion can be minimized in a statistically 
consistent way, for example, by methods based on approximations of d{-\-) via 
averaged functions obtained by convolution with so-called moUifiers. For details, 
we refer to [9]. 

6 Conclusion 

This contribution proposes a theoretical sound foundation of graph quantiza- 
tion generalizing the ideas of vector quantizations to the domain of attributed 
graph. We presented consistency results for graph quantizer design, where the 
underlying graph edit distances is generalized diffcrcntiablc. As for vectors, es- 
timators based on empirical distortion and stochastic optimization are statisti- 
cally consistent. If the underlying distortion measure is a discontinuous graph 
edit distance, estimators based on empirical distortion measures lack theoretical 
justification. Thus, the proposed consistency results justify existing research on 
prototype-based clustering in the domain of graphs. In addition, we showed that 
the Lloyd-Max conditions are necessary conditions for optimality of GQ. 

The mathematical framework that enables us to derive consistency results are 
Riemannian orbifolds. Identifying graphs with points in a Riemannian orbifold 
provides us locally access to a Euclidean space. This in turn allows us to introduce 
geometrical and analytical concepts for extending vector quantization to the 
domain of graphs. The implication of this approach is that it provides us a 
template for consistently linking methods from structural pattern recognition 
other than GQ to statistical pattern recognition methods. 

Acknowledgments. The first author is very grateful to Vladimir Norkin for 
his kind support and valuable comments. 

A Generalized D liferent iable Functions 

Let X = M" be a finite-dimensional Euclidean space. A function f : X ^ 'R 
is generalized differentiable at x G X in the sense of Norkin [27] if there is a 
multi-valued map 9/ : — )• 2'* in a neighborhood of x such that 

1. df{x) is a convex and compact set; 

2. df{x) is upper semicontinuous at x, that is, if yi x and Qi G df{yi) for 
each z G N, then each accumulation point g of (g^) is in df{x); 

3. for each y eX there is a g £ df{y) with f{y) = f{x)+{g, y-x)+o{x, y, g), 
where 

lim K^'^^'9;)l=o 
i—>-oo \\yi — x\\ 

for all sequences yi ^ y and gi ^ g with gi e df {yi). 



We call / generalized differentiable if it is generalized different iable at each point 
X G X. The set df{x) is the subdifferential of f ad, x and its elements are called 

generalized gradients. 

Generalized differentiable functions have the following properties [27]: 

(GDI) Generalized differentiable functions are locally Lipschitz and therefore 
continuous and differentiable almost everywhere. 

(GD2) Continuously differentiable, convex, and concave functions are general- 
ized differentiable. 

(GD3) Suppose that /i, ...,/„:<¥—>• M are generalized differentiable a.t x £ X. 
Then 

/♦(a;) = min(/i(a;), . . . ,/™(a;)) 
f*{x) =max(/i(a;),...,/„(a;)) 

are generalized differentiable at x G X. 
(GD4) Suppose that f\,---,fm : — )• M are generalized differentiable adx G X 

and /o : M™ M is generalized differentiable at y = {fi{x), . . . , fm{x)) G 
M™. Then f{x) = fo{fi{x), ■ ■ ■ , fm{x)) is generalized differentiable at 03 e 
X. The subdifferential of / at x is of the form 

df{x) = conjg G A" : g ^ [giOi ■ ■ ■ 9m\go, 
90 € dfoiy), 
Qi e dfi{x), l<i< my 

where [git/2 • • • gm] is a [N x m)-matrix. 
(GD5) Suppose that F{x) = [f{x,z)], where f{-,z) is generalized differen- 
tiable. Then F is generalized differentiable and its subdifferential a.t x G X 
is of the form dF{x) = [df{x, z)]. 

B Proofs 

Suppose that Q ~ {X, F, n) is a Riemannian orbifold. By ^^(a;) = {x' : < 6} 
we denote the open ball with center x and radius 5 > 0. Note that Usi'fix)) = 
jiUsix)) for all j € F. 

B.l Orbifold Functions 
Continuous Orbifold Functions 

Proposition 1. Let f : Xp R be an orbifold function. Suppose that its lift 
f : X ^ R is continuous at a vector representation x that projects to X G Xp- 
Then f is continuous at 7(0;) for all 'y G F. 



Proof. Let 7 G F be a permutation and x' = 7(0;). Suppose that {y'i)ien is a 
sequence with y'^ — >• x' . Then there is a sequence (yi)igN with 7(yi) = for 
each i e N. Since permutations are homeomorphisms, we find that 

Um yi = Um 7~^(y-) = 7~^(aj') = x. 

From continuity of / at cc follows that fi^yi) — >■ f{x). Since / is invariant under 
group actions from F, we have f{x) = f{x') and f{yi) = f{y'i) for each i e N. 
We obtain 

lim / (y^ = lim /(y,) = ~f{x) = /{x'). 

This proves that / is continuous at each vector representation that projects to 
X. □ 



Locally Lipschitz Orbifold Functions 

Proposition 2. Let f : Xp — *• M 6e an orbifold function. Suppose that its lift 

f : X ^ M is locally Lipschitz at a vector representation x that projects to 
X e Xp. Then f is locally Lipschitz at "f{x) for all "f Cz L. 

Proof. Since / is locally Lipschitz at x there is a L > and 5 > such that 



/»-/» 



<L\\y 



for all y,z G Us{x). Let 7 e 7^ be a permutation and x' = ^{x). Since 7 is an 
isometric homeomorphism, we have Us{x') = 7(^/5(03)). From r'-invariance of / 
and the isometric property of 7 follows 



/V)-/V) = 



<L\\y- 



L\\y' 



for all y', z' € Us{x'), where y = 7"^^') G Usix) and 2; = 'J'^iz) G Us{x). This 
proves that / is locally Lipschitz at each vector representation that projects to 
X. □ 



DifFerentiable Orbifold Functions 

Proposition 3. Let f : Xp R be an orbifold function. Suppose that its lift 
f : X is differentiable at a vector representation x that projects to X G Xp. 
Then f is differentiable at 7(0;) for all G F. The gradient of f at 7(0;) is of 
the form 

V/(7(a;)) = 7 (v/(a;)) • 
Proof. Since the lift / of / is differentiable at x, there is a 5 > such that 
fix + h) = fix) + (Vf ix), h) + oih) 



for all h e ^5(0). Let x' be an arbitrary vector representation that projects to 
X. Then there is a 7 € -T with x' = 7(2;). Since / is invariant under the group 
actions of F, we have f{x') = f{x). Then for each h' £ Us{0), we find that 

fix' + h') - fix') = fix + h)- fix) = (S7fix),h)+0ih), 

where h G X with 7(/i) = h' . Since the elements of F are isometrics, we have 
= giving h G UsiO). In addition, from isometry of 7 follows 



h) = (7 (V/ ix)) , 7(/i)) = (7 (V/ ix)) , h') . 

We obtain 

fix' + h') - fix') = (7 (V/ ix)) , h') + o'ih'), 
where o'ih') = 00 7~^(h') satisfies 

o'ih') o(7-i(h')) o(7-i(h')) „ 

lim — ^ — - = hm — — ^ — — = lim — — ^ — — = 0. 
llh'll llh'll ||7-i(/i')ll 

This proves that / is differentiable at each vector representation that projects 
to X. In addition, from the proof follows that the gradient of f a,t x' = 7(0;) is 
of the form 

V/(x')=7(v/(a;)). 

□ 



Generalized Differentiable Orbifold Functions 

Proposition 4. Let f : Xr M. be an orbifold function. Suppose that its lift 

f : X ^ M. is generalized differentiable at a vector representation x that projects 
to X G Xr- Then f is generalized differentiable at 7(0;) for all ^ G F and 

dfilix)) = 7 {dfix)) . 

is a subdifferential of f at "fix) for all 7 € 

Proof. Since / is generalized differentiable at x, there is a multi-valued mapping 
df : Usix) — > 2'* defined on some neighborhood (a;). Let 7 € -T be an arbitrary 
permutation and x' = "fix). Then 

df : Us ix') ^ 2-^ , y'=jiy)^j (a/(y)) 

is a multi- valued mapping in a neighborhood of x' . 

Since 7 is a homeomorphic linear map, we find that 7(3/(03)) = dfix') is a 
convex and compact set. 



Next we show that / is upper semicontinuous at x' . Suppose that y\ — )• x', 
g\ G fciy'i) for each i E N, and g' is an accumulation point of (g^)ieN- Then 
there is a io € N such that y- e Us{x') for all i> iq. From 

W^a;') =W5(7(a=)) =7(^4^)) 

follows that there are vector representations yi G Us{x) with 7(2/^) = y' for 
each i > ig. From continuity of 7^^ follows that ^ x. By construction of 9/ 
follows that 

g', e a/ (y^) = 5/(7 {Vi)) = 7 (5/(yi)) 

for each i > zq- Hence, there arc vector representations gi G df{yi) with "f{gi) = 
g'^ for each i > iq. Since / is upper semicontinuous at x, we find that g G df{x). 
Again by construction of df follows that 

9' = 7(9) e 7 = dfijix)) = dfix'). 

This proves upper semicontinuity of df at all vector representations projecting 

to X = 7r(a;). 

Finally, we prove that / satisfies the subdcrivativc property at x' . Suppose 
that y',y £ X with y' — 7(y). By P-invariance of /, wc have /(y') = /(y). 
Since / is generalized differentiable at x, we find a gr G df{y) such that 

fiy') = f{y) = fix) + {g,y-x)+ o{x, y, g) 

with o{x,y,g) tending faster to zero than ||y — a;||. Let g' = 7(9). Exploiting 
-T-invariance of / as well as isometry and linearity of 7 yields 

fiv') = filix)) + {l{g),l{y - x))+o{x,y,g) 
= fix') + {g', y' - x') + oix, y, g). 

We define o'(a;',y',g') = o o j~^{x' ,y' ,g') = oix,y,g) showing that o' tends 
faster to zero than normy' — x. This proves the subderivative property of / at 
all vector representations projecting to X = it{x). 

Putting all results together yields that / is generalized differentiable at 7(x) 
for all 7 e r. □ 

B.2 Lloyd-Metx Necessary Conditions for Optimality 

Due to the comparable nice analytical properties of Riemannian orbifolds, the 

proofs for the nearest neighbor and ceiitroid condition of optimal graph quan- 
tizers are similar to their respective counterparts in vector quantization. 

Theorem 3 (Nearest Neighbor Condition). Suppose thatC is a fixed code- 
book. Any graph quantizer Q : C with 

QiX) = argmind(Xy) 

Yec 

for all X G Xp, where ties are resolved according to some rule, has minimal 
expected distortion. 



Proof. Suppose that Q' : Xr — >^ C is a graph quantizer with arbitrary regions. 
Then we have 

d{X,Q'{X)) > mm d{X,Y) = d{X,Q{X)) 
for all X g Xr- This implies 

D{Q') = Ex [d {X, Q'{X))] > Ex [d {X, Q{X))] = D{Q). 

□ 

Theorem 4 (Nearest Neighbor Condition). Suppose thatPq is a fixed par- 
tition and Q : Xr C a graph quantizer with codebook C satisfying 

Y. = arg min E [d (X, Y)\Xe 11.] 

for all Y G Xr and all j ^ J . Then Q has minimal expected distortion. 

Proof. Let Pj = P{X e Ti-j). Suppose that Q' is a quantizer with partition 
{T?-!, . . . , TZk} and arbitrary codebook C = {Y{, . . . , Y^}. Then we have 

k 



E [d{X, Q'{X))] = PjE [d{X, Q'{X)) I X e Tlj] 

k 

= Y,PjE [d{X,Y^)\X€nj] 

j=l 

k 

>J2Pj^^ nd{X,Y)\X€'Rj] 



. , YeXr 
k 



= yj)\x ^ T^j] = ^ [d{x, Q{x))] 



□ 
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